Introduction Large language models (LLMs) are a promising generative artificial intelligence tool for oncologists to leverage in written physician-patient communication. Rapid evolution in LLM reasoning abilities has led to the emergence of a tiered system: free and paid LLMs differ greatly in capability. Chimeric antigen receptor T-cell therapy (CAR-T) is a relatively nascent treatment modality, and studies of CAR-T related patient education materials (PEM) have not yet evaluated LLMs on readability and content quality metrics. This study evaluates 5 LLMs – GPT 4.1 mini, ChatGPT 4o (OpenAI), Gemini 2.5 Flash, Gemini 2.5 Pro (Google), and Claude Sonnet 4 (Anthropic) – for response readability and quality as compared to existing PEM website articles.

Methods The 20 most common patient questions related to an initial query of “CAR-T Therapy” were generated from the Google Search People Also Ask algorithm (Google). Website articles linked by the algorithm to these questions formed the existing PEM group. The LLMs were prompted with the same 20 questions yielding sets of responses. For readability assessment, we utilized the Flesch Reading Ease score (FRE). FRE ranges from 0-100 with 100 indicating the most easily read text. To assess content quality, we utilized the validated Brief DISCERN instrument (BD). BD scores range from 0-30 with scores ≥16 indicating good quality, evidence-based content. Mean FRE and BD scores were calculated for each LLM's response sets and compared to existing PEM using one-way ANOVA with Dunnett's post-hoc testing. LLMs were then grouped into paid (GPT 4o, Gemini 2.5 Pro) and free (GPT 4.1 mini, Gemini 2.5 Flash, Claude Sonnet) categories for a secondary analysis versus the existing PEM group; mean FRE and BD scores for the 3 groups were compared using a one-way ANOVA with Tukey's post-hoc testing. Statistical analysis was performed using GraphPad Prism 10 (GraphPad Software).

Results Mean(SD) FRE was 46.9(13.1) for PEM, 33.9(15.9) for GPT 4.1, 25.8(7.6) for Gemini 2.5 Flash, 22.0(13.7) for Claude Sonnet, 39.9(14.0) for GPT 4o, and 37.3(8.3) for Gemini 2.5 Pro. One-way ANOVA showed a significant difference in mean FRE between groups, F(5,114) = 10.86, p<0.001. Dunnett's post-hoc tests showed that when compared to PEM, mean FRE was significantly lower for GPT 4.1 (46.9 vs 33.9, p=0.006), Gemini 2.5 Flash (46.9 vs 25.8, p<0.001), and Claude Sonnet (46.9 vs 22.0, p<0.001). Compared to PEM, no difference in mean FRE was found for GPT 4o (46.9 vs 39.9, p=0.27) and Gemini 2.5 Pro (46.9 vs 37.3, p=0.07). Mean(SD) BD was 20.5(6.0) for PEM, 9.7(1.8) for GPT 4.1, 14.2(3.5) for Gemini 2.5 Flash, 13.8(3.7) for Claude Sonnet, 10.0(2.0) for GPT 4o, and 15.8(2.2) for Gemini 2.5 Pro. One-way ANOVA indicated a significant difference in mean BD between groups, F(5,114) = 26.43, p<0.001. Dunnett's post-hoc tests showed that when compared to PEM, all 5 GenAI models had significantly lower BD: GPT 4.1 (20.5 vs 9.7, p<0.001), Gemini 2.5 Flash (20.5 vs 14.2, p<0.001), Claude Sonnet (20.5 vs 13.8, p<0.001), GPT 4o (20.5 vs 10.0, p<0.001), and Gemini 2.5 Pro (20.5 vs 15.8, p<0.001).

In our secondary analysis of paid and free LLMs vs PEM, mean(SD) FRE was 46.9(13.1) for PEM, 38.6(11.4) for paid GenAI, and 27.2(13.6) for free GenAI. One-way ANOVA showed the difference in mean FRE between groups was significant, F(2,117)=20.93, p<0.001. Tukey's post-hoc tests revealed that paid LLMs had a significantly higher score vs free (38.6 vs 27.2, p<0.001), as did PEM vs free (46.9 vs 27.2, p<0.001). However, mean FRE for paid vs PEM did not differ significantly. Mean(SD) BD was 20.5(6.0) for PEM, 12.9(3.6) for paid LLMs, and 12.5(3.7) for free LLMs. One-way ANOVA showed significant differences in mean BD between groups, F(2,117)=30.15, p<0.001. Tukey's post-hoc tests showed that paid vs free LLMs did have significantly different BD. However, each tier had a significantly lower BD compared to PEM: paid (12.9 vs 20.5, p<0.001) and free (12.5 vs 20.5, p<0.001).

Conclusions All 5 LLMs evaluated had lower quality responses to patient questions about CAR-T compared to existing PEM websites. Paid LLM tiers did not perform better than free tiers by this metric. No significant difference in readability was found between paid LLM tiers and existing PEM. However, free LLMs were significantly less readable than existing PEM. Overall, our results highlight a health disparity in LLM based patient education given the high cost of paid tiers.

This content is only available as a PDF.
Sign in via your Institution